A Framework for Using Tesseract to Transcribing Early Modern Texts Having Non-standard Fonts
نویسنده
چکیده
Here we describe a framework built upon Tesseract optical character recognition software for transcribing old texts having non-standard fonts. Further, we illustrate our software on creating a digital version of two volumes of a 17 century French text. The volumes consist of 808 pages having 84,366 words, and our system initially correctly transcribes 88% of the words. Further, we identify a methodology that will help to correct an additional 1,007 words; this would lead to 89% recognition accuracy.
منابع مشابه
مطالعۀ ارگونومیک پارامترهای تایپوگرافی در قلم های نوشتاری فارسی
Abstract Introduction The extensive development of written interactions in the current world of technology in one hand, and on the other hand noticeable dominance of English language in this milieu, has led to inadequate utilization of Farsi in such settings, even amongst native speakers. Lack of experimental data regarding legibility and readability of the printed and electronic texts related ...
متن کاملImproving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing
In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit...
متن کاملMemory Implementations - Current Alternatives
In an attempt to ensure good-quality printouts of our technical reports, from the supplied PDF files, we process to PDF using Acrobat Distiller. We encourage our authors to use outline fonts coupled with embedding of the used subset of all fonts (in either Truetype or Type 1 formats) except for the standard Acrobat typeface families of Times, Helvetica (Arial), Courier and Symbol. In the case o...
متن کاملMathematical Font Art
Currently, only a limited number of fonts are available for high quality mathematical typesetting, such as Knuth's computer modern font, the Stix font, and several fonts from the TEX Gyre family. An interesting challenge is to develop tools which allow users to pick any existing favorite font and to use it for writing mathematical texts. We will present progress on this problem as part of recen...
متن کاملUnsupervised Transcription of Historical Documents
We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially...
متن کامل